Refactor Interaction and Better Testing #71

shenoynikhil · 2024-03-26T19:58:23Z

Fixes #63

💯 Shoutout for already existing code from Danny which made working on this super easy 💯

Dataset Preprocessing

Updated config-factory for L7, X40, Splinter and MetCalf and performed preliminary checks.
Changed L7 reading from yaml file code and the read_raw_entries for both L7 and X40
Changed Splinter None setting to -1 (@mcneela please help here if you know any other place you did this)
MetCalf Processing steps were missing, so added a function that extracts raw files and then calls read_raw_entries()

Simplified the Interaction Dataset class by using

pkl_data_keys functionality: extended it in BaseInteractionDataset to enforce n_atoms_first is present in the data_dict
Removed x[x==None] = -1 done in save_preprocess by doing the changes in read_raw_entries. For now I noticed only Splinter doing this.
Add dummy interaction dataset for testing

Testing

Add tests for DummyInteractionDataset

Feel free to drop suggestions @prtos @mcneela @FNTwin

Checklist:

Was this PR discussed in a issue? It is recommended to first discuss a new feature into a GitHub issue before opening a PR.
Add tests to cover the fixed bug(s) or the new introduced feature(s) (if appropriate).
Update the API documentation is a new function is added or an existing one is deleted.

openqdc/datasets/base.py

openqdc/datasets/interaction/X40.py

openqdc/datasets/interaction/base.py

openqdc/datasets/interaction/dummy.py

openqdc/datasets/interaction/splinter.py

tests/test_interaction.py

openqdc/datasets/interaction/base.py

openqdc/datasets/base.py

openqdc/datasets/interaction/base.py

openqdc/datasets/base.py

FNTwin

Love the PR, make stuff more clean and add a way to fetch the raw files.

Beside some small comments, I think that we should refactor the DES classes. Currently we have 4 files for DES:
DES370K, DES5M(DES370K), DES66 , DES66x8

I would move DES5M and DES370K classes in a single DES file.

I would do the same with DES66 and DES66x8 into a DES66 file and make one of them inherit the other one to have a more complex type that we could use in case.

More importantly. I think there is an error in the DES5M read_raw_entries(self) method.
Currently the class is a child of DES370K:

 def read_raw_entries(self) -> List[Dict]:
       return DES5M._read_raw_entries()

This seems wrong as it seems to me that it is recurrently calling itself. It should be:

 def read_raw_entries(self) -> List[Dict]:
       return super()._read_raw_entries()

or

 def read_raw_entries(self) -> List[Dict]:
       return DES370K._read_raw_entries()

tests/test_dummy.py

openqdc/datasets/interaction/dummy.py

openqdc/datasets/interaction/base.py

openqdc/datasets/interaction/X40.py

FNTwin · 2024-04-06T15:00:06Z

I don't have thoroughly tested it currently. I'll try to run/debug the classes better sunday or monday.
I still think that most of the read_raw_entries() functions are way too long and complex (read as: doing too many things). I'm not a fun of SOLID principles but ... 😂

mcneela

Left a couple minor comments, feel free to address those, but otherwise looks good I think!

openqdc/datasets/interaction/X40.py

openqdc/datasets/interaction/base.py

openqdc/datasets/interaction/splinter.py

Interaction dataset refactoring

FNTwin

The interaction dataset module is getting cleaner and I quite like it. I know there is some issues on naming, restraining the data keys to property methods but I think overall this kind of inheritance will help us along the way as the preprocess methods should be hardly touched at all and if we need to touch them, it means that we did something wrong at the beginning.

Nonetheless, there are some issues on changing the keys (+ some name kerfuffles) that requires reproprecessing the datasets and repushing them to the cloud. Currently because of that I m unable to give you a more in depth review as my testing can be quite limited.

Side note:

We should stop calling __name__ directly and provide a property name that sanitize the name to avoid problems
We really need more solid tests

As soon as we fix the main issues, I can do a round of testing for a final approval if you think it is needed. This should be a good time to actually clean the push methods as we currently are forced to always push to cloud and it is a not intended behaviour

FNTwin · 2024-04-13T14:12:04Z

openqdc/datasets/interaction/base.py

@@ -61,7 +35,7 @@ def __getitem__(self, idx: int):
        )
        name = self.__smiles_converter__(self.data["name"][idx])
        subset = self.data["subset"][idx]
-        n_atoms_first = self.data["n_atoms_first"][idx]
+        n_atoms_ptr = self.data["n_atoms_ptr"][idx]


All the datasets are currently using the n_atom_first keyword that is processed in the read_raw_entries . We need to recompute and push the data with this new keywork

Yes, we need to recompute and push, before merging these changes.

FNTwin · 2024-04-13T14:15:16Z

openqdc/datasets/interaction/base.py

Currently trying to load any interaction dataset will get you an error due to the:
if not self.is_preprocessed() failing due to the naming.

In the bucket they were written L7 and X40 (upper case). We should always have the sanitize name on lower case. As we need to postprocess it again to have the new keys. It will fix itself

Yup, need to push new changes.

FNTwin · 2024-04-13T14:34:08Z

openqdc/datasets/interaction/base.py

+            "name": str,
+            "subset": str,
+            "n_atoms": np.int32,
+            "n_atoms_ptr": np.int32,


the n_atoms_ptr will make the read_preprocess fail for the interaction datasets because we currently don't have that key inside (see my other comments).

More in depth, Line 393: assert all([key in all_pkl_keys for key in self.pkl_data_keys])

will give you :

['name', 'subset', 'n_atoms', 'n_atoms_ptr'] [True, True, True, False]

FNTwin · 2024-04-13T14:34:34Z

openqdc/datasets/interaction/base.py

@@ -78,68 +52,18 @@ def __getitem__(self, idx: int):
            name=name,
            subset=subset,
            forces=forces,
-            n_atoms_first=n_atoms_first,
+            n_atoms_ptr=n_atoms_ptr,


Same comment as before

FNTwin · 2024-04-13T14:35:22Z

openqdc/datasets/interaction/des.py

We need to change n_atom_first to n_atom_ptr and preprocess it again + push

FNTwin · 2024-04-13T14:35:57Z

openqdc/datasets/interaction/l7.py

We need to change n_atom_first to n_atom_ptr and preprocess it again + push

FNTwin · 2024-04-13T14:36:23Z

openqdc/datasets/interaction/metcalf.py

We need to change n_atom_first to n_atom_ptr and preprocess it again + push

openqdc/datasets/interaction/splinter.py

FNTwin · 2024-04-13T14:37:18Z

openqdc/datasets/interaction/x40.py

We need to change n_atom_first to n_atom_ptr and preprocess it again + push

FNTwin · 2024-04-13T15:14:11Z

This should be a good time to actually clean the push methods as we currently are forced to always push to cloud and it is a not intended behaviour

Adressed this right now in the qol branch with also a new cli method to easily preprocess and optionally push datasets

https://github.com/OpenDrugDiscovery/openQDC/tree/qol

FNTwin · 2024-04-18T19:55:32Z

As I preprocessed and merged this branch in QoL, I m going to close it

refactor interaction and initial testing

31beb71

shenoynikhil changed the base branch from main to develop March 26, 2024 19:58

shenoynikhil changed the title ~~Better Testing~~ Refactor Interaction and Better Testing Mar 26, 2024

shenoynikhil commented Mar 26, 2024

View reviewed changes

openqdc/datasets/base.py Show resolved Hide resolved

Nikhil Shenoy added 2 commits March 26, 2024 20:25

minor changes

dccf676

dummy modification

2ab64aa

mcneela requested changes Mar 27, 2024

View reviewed changes

undo changes in interaction dataset, and minor change in shape

189ab90

shenoynikhil requested a review from mcneela March 29, 2024 01:33

changed super class to BaseInteractionDataset

282dc91

shenoynikhil changed the base branch from develop to release April 3, 2024 17:45

Nikhil Shenoy added 2 commits April 3, 2024 19:53

Merge branch 'release' into testing

701ef1e

further simplified and rebase

afea053

shenoynikhil commented Apr 3, 2024

View reviewed changes

openqdc/datasets/base.py Outdated Show resolved Hide resolved

mcneela reviewed Apr 4, 2024

View reviewed changes

openqdc/datasets/base.py Show resolved Hide resolved

mcneela reviewed Apr 4, 2024

View reviewed changes

openqdc/datasets/interaction/base.py Outdated Show resolved Hide resolved

mcneela reviewed Apr 4, 2024

View reviewed changes

openqdc/datasets/interaction/base.py Outdated Show resolved Hide resolved

mcneela reviewed Apr 4, 2024

View reviewed changes

openqdc/datasets/base.py Outdated Show resolved Hide resolved

mcneela reviewed Apr 4, 2024

View reviewed changes

openqdc/datasets/base.py Show resolved Hide resolved

shenoynikhil mentioned this pull request Apr 5, 2024

Bug Fixes and Component-wise Force Simplification #78

Merged

3 tasks

Nikhil Shenoy added 6 commits April 5, 2024 21:13

fixes

ebc2adf

Merge remote-tracking branch 'origin/release' into testing

7ffd0b1

Merge remote-tracking branch 'origin/release' into testing

d15e9cf

Updated metcalf

ed8e264

bug fix and simplifying interaction dataset

18bc79c

Updated tests for interaction datasets

2a6e3ef

shenoynikhil requested a review from mcneela April 6, 2024 01:15

shenoynikhil assigned prtos Apr 6, 2024

removed stale stats in dummy interaction

7493273

FNTwin reviewed Apr 6, 2024

View reviewed changes

Nikhil Shenoy and others added 14 commits April 6, 2024 16:39

changes based on comments

ed73e7d

Clean metcalf

0359022

Simplification

33fa342

cleaned des

cd486a8

Simplified des dataset

80d7371

removed redundant dataset files

f3d205c

DES inerithance

da4fece

Removed des and improved des naming

71ff741

DES fixes

f6e12e1

Removed comments

3328a65

X40 and L70

8b28d59

Safe opening

8595fd8

Moved X40 in L7 and removed x40.py

ca1b4af

Moved Yaml utils to _utils.py, L7 + X40 interface

4bec82d

mcneela approved these changes Apr 7, 2024

View reviewed changes

openqdc/datasets/interaction/X40.py Outdated Show resolved Hide resolved

openqdc/datasets/interaction/base.py Outdated Show resolved Hide resolved

openqdc/datasets/interaction/splinter.py Outdated Show resolved Hide resolved

FNTwin and others added 3 commits April 8, 2024 08:47

Merge testing + Add imports

a5ced0a

Merge pull request #79 from OpenDrugDiscovery/interaction_impr

a21963e

Interaction dataset refactoring

better convert function and n_body_first to ptr

3303f95

FNTwin reviewed Apr 13, 2024

View reviewed changes

Updated splinter reading from -1 to nan

6f033cf

FNTwin closed this Apr 18, 2024

FNTwin deleted the testing branch July 10, 2024 20:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor Interaction and Better Testing #71

Refactor Interaction and Better Testing #71

shenoynikhil commented Mar 26, 2024 •

edited

Loading

FNTwin left a comment

FNTwin commented Apr 6, 2024

mcneela left a comment

FNTwin left a comment •

edited

Loading

FNTwin Apr 13, 2024

shenoynikhil Apr 15, 2024

FNTwin Apr 13, 2024 •

edited

Loading

shenoynikhil Apr 15, 2024

FNTwin Apr 13, 2024

FNTwin Apr 13, 2024

FNTwin Apr 13, 2024

FNTwin Apr 13, 2024

FNTwin Apr 13, 2024

FNTwin Apr 13, 2024

FNTwin commented Apr 13, 2024

FNTwin commented Apr 18, 2024

Refactor Interaction and Better Testing #71

Refactor Interaction and Better Testing #71

Conversation

shenoynikhil commented Mar 26, 2024 • edited Loading

Dataset Preprocessing

Simplified the Interaction Dataset class by using

Testing

FNTwin left a comment

Choose a reason for hiding this comment

FNTwin commented Apr 6, 2024

mcneela left a comment

Choose a reason for hiding this comment

FNTwin left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FNTwin Apr 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FNTwin commented Apr 13, 2024

FNTwin commented Apr 18, 2024

shenoynikhil commented Mar 26, 2024 •

edited

Loading

FNTwin left a comment •

edited

Loading

FNTwin Apr 13, 2024 •

edited

Loading